Show the code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, ggplot2, skimr, tm)Take home 3 - question 1
Junseok Kim
Jun 17, 2023
###Check missing values
# A tibble: 0 × 4
# ℹ 4 variables: source <chr>, target <chr>, type <chr>, weights <int>
# A tibble: 27,622 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Comp… 310612303. Automobiles
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,…
3 Aqua Advancements Sashimi SE Expr… Oceanus Comp… 115004667. Holding firm wh…
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric …
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
7 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
8 Assam Limited Liability Company Utopor… Comp… 72162317. Power and Gas s…
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia…
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr…
# ℹ 27,612 more rows
21515 missing from revenue_omu column
# A tibble: 2,595 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Smith Ltd ZH Company NA Unknown
2 Williams LLC ZH Company NA Unknown
3 Garcia Inc ZH Company NA Unknown
4 Walker and Sons ZH Company NA Unknown
5 Walker and Sons ZH Company NA Unknown
6 Smith LLC ZH Company NA Unknown
7 Smith Ltd ZH Company NA Unknown
8 Romero Inc ZH Company NA Unknown
9 Niger River Marine life Oceanus Company NA Unknown
10 Coastal Crusaders AS Industrial Oceanus Company NA Unknown
# ℹ 2,585 more rows
# A tibble: 25,027 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Comp… 310612303. Automobiles
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,…
3 Aqua Advancements Sashimi SE Expr… Oceanus Comp… 115004667. Holding firm wh…
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric …
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
7 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
8 Assam Limited Liability Company Utopor… Comp… 72162317. Power and Gas s…
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia…
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr…
# ℹ 25,017 more rows
# A tibble: 10 × 2
Products Occurrences
<chr> <int>
1 character(0) 16395
2 Unknown 4614
3 Fish and seafood products 63
4 Seafood products 55
5 Fish and fish products 31
6 Food products 31
7 Canning, processing and manufacturing of seafood and other aquat… 23
8 Footwear 21
9 Seafood 20
10 Grocery products 19
::: {.cell}
```{.r .cell-code}
skim(mc3_edges)
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
:::
mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
linewidth = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_linewidth_continuous(range=c(1,10))+
theme_graph() +
theme(text = element_text(family = "sans"))
| Name | mc3_nodes1 |
| Number of rows | 37324 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 6 | 700 | 0 | 34121 | 0 |
| country | 29241 | 0.22 | 2 | 14 | 0 | 78 | 0 |
| type | 29241 | 0.22 | 7 | 16 | 0 | 3 | 0 |
| product_services | 29241 | 0.22 | 4 | 1737 | 0 | 1844 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 34139 | 0.09 | 939014 | 12435469 | 3652.23 | 8261.03 | 16966.67 | 48266.67 | 310612303 | ▇▁▁▁▁ |
##Text Sensing with tidytext
# A tibble: 37,324 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Gvardeysk Sextant ОАО Cargo Uziland Comp… 73027. Fish salads (It… 11
2 Taylor LLC ZH Comp… 138982. Fish (anchovy, … 11
3 SeaSelect Foods Salt spray Marebak Comp… 41902. European whole … 7
4 Arunachal Pradesh s S.A. d… Marebak Comp… 60346. Offers a wide r… 6
5 suō yú Ltd. Liability Co Coralm… Comp… 31567. Offers a wide r… 6
6 Estrella del Mar SE Riodel… Comp… 12565. Whole fresh fis… 5
7 Mar de la Vida S.p.A. Expr… Gavano… Comp… 97490. Fresh and froze… 5
8 Moore LLC ZH Comp… 61273. Fish and fish p… 5
9 Banded tilapia Corporation… Solova… Comp… 17498. Specialises in … 4
10 Costa de la Felicidad Ltd Puerto… Comp… NA Diverse range o… 4
# ℹ 37,314 more rows

stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
geom_text(aes(label = n), vjust = 0.5, hjust = -0.1, size = 2, color = "black")+
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Top 15 unique words found in product_services field")+
theme_minimal() 
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
stopwords_removed_unique %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
geom_text(aes(label = n), vjust = 0.5, hjust = -0.1, size = 2, color = "black")+
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Top 15 unique words found in product_services field")+
theme_minimal() 
words_fishery <- c("fish", "seafood", "frozen", "food", "fresh", "salmon", "shrimp", "shellfish", "sea", "squid", "water", "seafoods", "foods", "marine", "shipment", "shipping", "pier", "carp", "cod", "herring", "lichen", "mackerel", "pollock", "shark", "tuna", "ocean", "oyster", "clam", "lobster", "crab", "crustaceans", "crustacean", "bass")
mc3_nodes_fishery <- mc3_nodes_unique %>%
filter(str_detect(product_services, paste(words_fishery, collapse = "|", sep = "")) | is.na(product_services))
print(mc3_nodes_fishery)# A tibble: 1,534 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
2 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
3 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
4 Fisher Group ZH Comp… 29981457. Steel (marketin…
5 Morales, Young and Taylor ZH Comp… 23739782. Processed foods…
6 Morgan LLC ZH Comp… 17939781. Animal feed, an…
7 Neptune's Harvest LC Transport Riodel… Comp… 8726579. Frozen whole an…
8 Victoria Falls Limited Liabilit… Rio Is… Comp… 8014806. Domestic and in…
9 Caracola del Mar NV Family Rio Is… Comp… 7085566. Canned, frozen …
10 The Sea Lion NV Marine biology Oceanus Comp… 6264744. One- to five-da…
# ℹ 1,524 more rows
| Name | mc3_edges_new |
| Number of rows | 3711 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 7 | 213 | 0 | 1493 | 0 |
| target | 0 | 1 | 6 | 27 | 0 | 2887 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
mc3_edges_new_filtered <- mc3_edges_new %>%
filter(startsWith(source, "c("))
#step 2
mc3_edges_new_split <- mc3_edges_new_filtered %>%
separate_rows(source, sep = ", ") %>%
mutate(source = gsub('^c\\(|"|\\)$', '', source))
#remove rows with grouped
mc3_edges_new2 <- mc3_edges_new %>%
anti_join(mc3_edges_new_filtered)
#Add rows in step #2
mc3_edges_new2 <- mc3_edges_new2 %>%
bind_rows(mc3_edges_new, mc3_edges_new_split)
#group
mc3_edges_new_groupby <- mc3_edges_new2 %>%
group_by(source, target, type) %>%
summarize(weight = n()) %>%
filter(weight >1) %>%
ungroup()| Name | mc3_edges_new_groupby |
| Number of rows | 3703 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 7 | 57 | 0 | 1485 | 0 |
| target | 0 | 1 | 6 | 27 | 0 | 2887 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weight | 0 | 1 | 2 | 0.09 | 2 | 2 | 2 | 2 | 5 | ▇▁▁▁▁ |
#link the latest edges to the nodes
source_missing <-setdiff(mc3_edges_new_groupby$source, mc3_nodes_fishery_new$id)
source_missing_df <- tibble(
id = source_missing,
country = rep(NA_character_, length(source_missing)),
type = rep("Company", length(source_missing)),
revenue = rep(NA_real_, length(source_missing)),
product_services = rep(NA_character_, length(source_missing))
)
target_missing <- setdiff(mc3_edges_new_groupby$target, mc3_nodes_fishery_new$id)
target_missing_df <- tibble(
id = target_missing,
country = rep(NA_character_, length(target_missing)),
type = rep("Company", length(target_missing)),
revenue = rep(NA_real_, length(target_missing)),
product_services = rep(NA_character_, length(target_missing))
)mc3_nodes_fishery_grouped <- mc3_nodes_fishery_new_df %>%
group_by(id) %>%
summarize(
count = n(),
type_1 = ifelse(n() >= 1, type[1], NA),
type_2 = ifelse(n() >= 2, type[2], NA),
type_3 = ifelse(n() >= 3, type[3], NA),
country = ifelse(n() == 1, country, paste(unique(country), collapse = ", ")),
revenue_omu = ifelse(n() == 1, revenue_omu, paste(unique(revenue_omu), collapse = ", ")),
product_services = ifelse(n() == 1, product_services, paste(unique(product_services), collapse = ", "))
)# A tibble: 10 × 2
Products Occurrences
<chr> <int>
1 <NA> 3560
2 Fish and seafood products 37
3 Seafood products 23
4 Canning, processing and manufacturing of seafood and other aquat… 18
5 Fish and fish products 15
6 Seafood 11
7 Fish and sea food products 9
8 Fish and seafoods products 9
9 Fresh and frozen seafood 9
10 Tuna, sword fish, bass, trout, and salmon, as well as offers she… 9
mc3_fish_graph <- tbl_graph(nodes = mc3_nodes_fishery_,
edges = mc3_edges_fishery,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
ggraph(layout = "nicely") +
scale_edge_width(range = c(0.01, 6)) +
geom_node_point(aes(colour = type_1,
size = betweenness_centrality)) +
theme_graph() +
labs(size = "Betweenness Centrality")
mc3_fish_graph 
mc3_edges_fishery_in <- mc3_edges_fishery %>%
rename(from = source, to = target)
mc3_nodes_fishery_in <- mc3_nodes_fishery_ %>%
rename(group = type_1)
# Create a visNetwork object with nodes and edges
visNetwork(nodes = mc3_nodes_fishery_in, edges = mc3_edges_fishery_in) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)# A tibble: 1,324 × 1
# Groups: target [508]
target
<chr>
1 Elizabeth Jones
2 Michael Morrison
3 Amanda Robinson
4 Andrew Taylor
5 Brandon Cruz
6 Michael Thompson
7 Melissa Martin
8 Christopher Ramos
9 Richard Smith
10 Andrew Reed
# ℹ 1,314 more rows
# A tibble: 2,887 × 2
target count
<chr> <int>
1 Michael Johnson 11
2 John Smith 10
3 Brian Smith 8
4 Jennifer Johnson 8
5 Michael Smith 8
6 Richard Smith 8
7 David Smith 7
8 James Brown 7
9 James Smith 7
10 Melissa Brown 7
# ℹ 2,877 more rows
filtered_data <- mc3_edges_fishery %>%
group_by(target) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
select(target, count)
filtered_data %>%
group_by(count) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n) * 100, 2)) %>%
ggplot(aes(x = count, y = n, label = paste0(round(percentage, 2), "%"))) +
geom_bar(stat = "identity", fill = "lightblue", color = "black") +
xlab("Count") +
ylab("n") +
scale_x_continuous(breaks = unique(filtered_data$count)) +
geom_text(position = position_stack(vjust = 0.5))
filtered_data2 <- mc3_edges_fishery %>%
group_by(target) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
arrange(desc(count)) %>%
select(target, count)
filtered_data2 %>%
group_by(count) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n) * 100, 2)) %>%
ggplot(aes(x = count, y = n, label = paste0(round(percentage, 2), "%"))) +
geom_bar(stat = "identity", fill = "lightblue", color = "black") +
xlab("Count") +
ylab("n") +
scale_x_continuous(breaks = unique(filtered_data$count)) +
geom_text(position = position_stack(vjust = 0.5))
Through data exploration, I was able to observe a few anomalies.
individuals who have ownership in multiple companies. From analyzing three sub-network graphs, it was observed that these individuals tend to own a combination of large and small firms from various countries. While there is a possibility that everything is legitimate, it would be beneficial for FishEye to conduct a more thorough examination of these individuals who own companies across borders, particularly when they are the sole owners of smaller entities, as exemplified in the case of ‘James Brown’.
All of the id type is “Company”, surprisingly no beneficial owners and company contacts. Sort of expected looking from the distribution of types for Node dataframe intiially, but it might be noteworthy to investigate further what id type mean for illegal fishery.